Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to_netcdf append bug (GH1215) #1609

Merged
merged 19 commits into from
Oct 25, 2017
Merged

fix to_netcdf append bug (GH1215) #1609

merged 19 commits into from
Oct 25, 2017

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Oct 4, 2017

TODO:
Additional tests needed to verify this works for all the write backends and that the correct errors are raised when dims are different between writes.

@fmaussion
Copy link
Member

Thanks @jhamman !

With this fix existing variables will silently be ignored and won't be written, right? Maybe the expected behaviour (or specs) of the "append" option should be written somewhere in the docs and tested to avoid future regressions like we had in #1215

@jhamman
Copy link
Member Author

jhamman commented Oct 10, 2017

@fmaussion - you bring up a good point. There are two scenarios here.

  1. appending to a file with existing data variables
  2. appending to a file with existing coordinate variables

I'm wondering if we should disallow 1, in favor of being more explicit. So:

list_of_vars_to_append = ['var1', 'var2']
ds[list_of_vars_to_append].to_netcdf(filename, mode='a')
# if either var1 or var2 are in filename, raise an error?

as apposed to the current behavior which would silently skip all vars already in filename and not in list_of_vars_to_append.

@fmaussion
Copy link
Member

Netcdf4 would overwrite in this situation, and I am also in favor of overwriting as this could be quite a useful usecase:

from netCDF4 import Dataset
with Dataset('test.nc', 'w', format='NETCDF4') as nc:
    nc.createDimension('lat', 1)
    nc.createVariable('lat', 'f4', ('lat',))
    nc['lat'][:] = 1
with Dataset("test.nc", "a") as nc:
    nc['lat'][:] = 2

@jhamman
Copy link
Member Author

jhamman commented Oct 18, 2017

@fmaussion - I've updated the append logic slightly. I'm wondering what you think? This version more aggressively overwrites existing variables (data_vars and coords).

@fmaussion
Copy link
Member

Thanks! I like it: simple and in accordance with netcdf4's silent overwriting. It would be cool to describe this behavior this in the documentation somewhere, and add a test that the data is correctly overwritten maybe?

@jhamman
Copy link
Member Author

jhamman commented Oct 21, 2017

Question for those who are familiar with the scipy backend. I have a few failing tests here on the scipy backend and I'm not really sure what's going on. It seems like it could be related to the mmap feature in scipy.

@shoyer
Copy link
Member

shoyer commented Oct 21, 2017

@jhamman I think this is something to do with string/bytes. The data gets written as Unicode strings but then read back in as bytes, which is obviously not ideal.

If you want to ignore this, like the current tests, you can use assert_allclose() which has a flag for ignoring string/bytes issues (yeah, I know it's a nasty hack). In the long run we need to figure out how to solve issues like #1638. It seems like Unidata/netcdf-c#402 is likely relevant.

doc/io.rst Outdated
@@ -176,6 +176,10 @@ for dealing with datasets too big to fit into memory. Instead, xarray integrates
with dask.array (see :ref:`dask`), which provides a fully featured engine for
streaming computation.

It is possible to append or overwrite netCDF variables using the ``mode='a'``
argument. When using this option, all variables in the dataset will be written
to the netCDF file, regardless if they exist in the original dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity: in the original file

@@ -963,7 +963,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
default format becomes NETCDF3_64BIT).
mode : {'w', 'a'}, optional
Write ('w') or append ('a') mode. If mode='w', any existing file at
this location will be overwritten.
this location will be overwritten. If mode='a', exisitng variables
will be over written.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overwritten in one word?

data['var9'] = data['var2'] * 3
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a',
engine='scipy')
actual = open_dataset(tmp_file, engine='scipy')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is triggering a test failure on Windows, where you apparently can't open the same file twice.

data['var9'] = data['var2'] * 3
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a',
engine='netcdf4')
actual = open_dataset(tmp_file, autoclose=self.autoclose,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than allowing a clean-up failure, please close this file if possible (use a context manager). File descriptors are still a limited resource for our test suite.

@@ -823,6 +823,37 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
autoclose=self.autoclose, **open_kwargs) as ds:
yield ds

@contextlib.contextmanager
def roundtrip_append(self, data, save_kwargs={}, open_kwargs={},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put mode of these tests in one of the base classes, so we don't separately repeat them for scipy, netcdf4 and h5netcdf?

target, source = self.prepare_variable(
name, v, check, unlimited_dims=unlimited_dims)
if (vn not in self.variables or
(getattr(self, '_mode', False) != 'a')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the explicit check here for the mode here. With mode='w', the file should already be starting from scratch.

mode = 'a' if i > 0 else 'w'
data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs)
with open_dataset(tmp_file,
autoclose=self.autoclose, **open_kwargs) as ds:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - I've put these tests in a base class now. I've never quite understood how these tests work though. Are the inherited classes passing open/save_kwargs that specify the engine? How does this work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly right.

One day it would be nice to split apart the gigantic test_backends.py into separate files for each backend...

data['var9'] = data['var2'] * 3
data[['var2', 'var9']].to_netcdf(tmp_file, mode='a')
with open_dataset(tmp_file, autoclose=self.autoclose) as actual:
assert_identical(data, actual)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the issue with windows. What's the best way to do this then? Is it possible to explicitly close a netCDF file written with to_netcdf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_netcdf() should write and close the file. You only need to use the context manager to ensure things get closed with open_dataset().

@jhamman
Copy link
Member Author

jhamman commented Oct 24, 2017

@shoyer - ready for final review.

@@ -974,7 +974,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
default format becomes NETCDF3_64BIT).
mode : {'w', 'a'}, optional
Write ('w') or append ('a') mode. If mode='w', any existing file at
this location will be overwritten.
this location will be overwritten. If mode='a', exisitng variables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exisitng -> existing

allow_cleanup_failure=allow_cleanup_failure) as tmp_file:
for i, key in enumerate(data.variables):
mode = 'a' if i > 0 else 'w'
data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently always using the default backend. You need to explicitly set engine here somewhere, though much of the logic can potentially live on the base class or helper function of some sort. There's no magic that sets this automatically for subclasses.

Take a look at how roundtrip() looks. I think I led you astray by suggesting that you move everything to the baseclass.

@@ -592,6 +627,8 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

@requires_netCDF4
class BaseNetCDF4Test(CFEncodedDataTest):
engine = 'netcdf4'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer, what do you think about this solution?

@@ -1024,6 +1030,8 @@ class ScipyFilePathTestAutocloseTrue(ScipyFilePathTest):

@requires_netCDF4
class NetCDF3ViaNetCDF4DataTest(CFEncodedDataTest, Only32BitTypes, TestCase):
engine = 'netcdf4'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite enough -- you also need to set the format in this case. Take a look at roundtrip() below.

Probably the cleanest fix would be refactor roundtrip() into three methods:

     # on the base class
     @contextlib.contextmanager
     def roundtrip(self, data, save_kwargs={}, open_kwargs={},
                   allow_cleanup_failure=False):
         with create_tmp_file(
                 allow_cleanup_failure=allow_cleanup_failure) as path:
             self.save(data, path, **save_kwargs)
             with self.open(path, **open_kwargs) as ds:
                 yield ds

     # on subclasses, e.g., for NetCDF3ViaNetCDF4DataTest
     def save(self, dataset, path, **kwargs):
         dataset.to_netcdf(tmp_file, format='NETCDF3_CLASSIC',
                           engine='netcdf4', **kwargs)

     @contextlib.contextmanager
     def open(self, path, **kwargs):
          with open_dataset(tmp_file, engine='netcdf4',
                            autoclose=self.autoclose, **open_kwargs) as ds:
               yield ds

Then you could write roundtrip_append() in terms of save and open.

@jhamman
Copy link
Member Author

jhamman commented Oct 25, 2017

@shoyer - take another look. I have basically merged our two ideas and refactored the roundtrip tests. Tests are still failing but not for py2.7, on appveyor, or py3.6 locally.

@shoyer
Copy link
Member

shoyer commented Oct 25, 2017

@jhamman This looks great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants